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SUMMARY 

The paper briefly describes the self-growing neural network algorithm, CID3, which makes decision 
trees equivalent to hidden layers of a neural network. The algorithm generates a feedforward architecture 
using crisp and fuzzy entropy measures. The results of a real-life recognition problem of distinguishing 
defects in a glass ribbon and of a benchmark problem of differentiating two spirals are shown and 
discussed. 


INTRODUCTION 

Supervised neural network algorithms usually require implementation of a trial-and-error method 
to find proper architectures. To help in determining a feedforward neural network architecture, it was 
shown [1] that a four-layered network with two hidden layers can solve arbitrary classification problems, 
lrie and Miyake [2] proved that a three-layered backpropagation network with an infinite number of 
nodes in the hidden layer can also solve arbitrary mapping problems. The “tiling” algorithm of Nadal 
[3] generates a feedforward network in a sequential manner by adding nodes and layers without the need 
for guessing the network’s architecture, but it docs not specify the sequence in which nodes should be 
added to maximize classification of training examples. The Algorithm described in [4] uses information 
entropy to determine generation of nodes and hidden layers. 

Information entropy, however, has been used for a long lime in machine learning research, where 
numerous learning algorithms have been developed to solve pattern recognition problems. A machine 
learning algorithm of particular interest to us is the ID3 algorithm of Quinlan [5], which dynamically 
generates a decision tree while minimizing information entropy. Recent studies of the ID3 algorithm and 
backpropagation neural networks [6, 7] prompted that the ideas similar to the ID3 algorithm may be used 
to answer two fundamental questions concerning a neural network architecture, namely, how to decide the 
number of layers and the number of nodes per layer. One goal of this paper is to show the close relation- 
ship between inductive machine learning and feedforward neural networks. This will be done by intro- 
ducing the main ideas of a Continuous ID3 (CID3) algorithm [8]. 

Machine learning and neural networks arc two very closely related fields of artificial intelligence, 
sharing many common ideas and problems. Effective methods of one can be used to overcome the 
difficulties of the other. It is interesting to note that the starting point in the development of the C1D3 
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algorithm was machine learning [9]. The advantage of using a machine learning approach to 
generate a feedforward neural network is that the knowledge embedded in the connections and weights 
can be translated into decision rules. To achieve fast convergence, a learning rule using information 
entropy was combined with Cauchy training [8, 10, 1 1 ] in the CID3 algorithm. 

In order to make the main ideas of the CID3 algorithm clear, it is necessary to briefly introduce 
Quinlan’s ID3 algorithm first {5]. ID3 generates decision rules from a set of training examples. Each 
example is represented by a list of features. In the training process, class memberships of the input data 
must be known. The idea is to find the minimum number of original features that suffice in determining 
class memberships. ID3 uses information theory to select features which give the greatest information 
gain or decrease of entropy. Entropy is defined as -plog 2 p, where probability p is determined from the 
frequency of occurrence. Since the number of new nodes added to a decision tree depends on the number 
of values that a selected feature can take on, the ID3 algorithm requires features to have discrete values. 
The generated decision tree is then described in terms of hierarchical decision rules which must be used in 
the order specified by the tree structure. The condition part of a decision rule consists of a number of 
feature tests linked by and/or logical operators. The drawback of a feature test is that the correlations 
between features are ignored. ID3 considers only how significant an individual feature is for classifying 
training examples. The next section shows how an Adalinc (adaptive linear neuron) [12] can be used for 
knowledge representation. 


EQUIVALENCE OF A DECISION TREE AND A HIDDEN LAYER 

As said before, the basic idea of the ID3 algorithm is to detect a feature yielding maximum 
information gain, so the training examples can be correctly classified. Let us consider a problem [8] of 
distinguishing nine positive examples from eight negative examples, as depicted in figure L 

The difficulty of applying the ID3 algorithm, and calculating corresponding entropy, comes from 
the fact that the coordinates of x , and x 2 take on continuous values. In order to apply ID3 to this problem, 
one could use thresholds, so that the examples could be located within certain regions [13]. The thresh- 
olds may be represented as vertical and horizontal lines in a two-dimensional space. In real applications, 
however, decision regions arc usually of higher order than a line, so the approximation may result in 
defining many high-dimensional decision regions. However, the decision region boundaries containing 
the same nine positive training examples can be formed by using hyperplanes defined by Adalines [12]. 

It is important to note that the feature test performed by ID3 can be treated as a special case of an Adalinc 
with its hyperplane parallel to an axis. The decision region covering nine examples can be described by 
only three hyperplanes, as shown in figure 2. 

To describe a decision region in terms of decision rules, the examples on the positive side of a 
hyperplane i (hyPj) satisfy “featurCj = 1”. Thus, the decision rules can be simply specified as follows: 

IF feature | = 0, and feature 2 = 0, and feature^ = I, THEN class = positive. 

IF feature 2 - 1, and feature^ = 0, THEN class = positive. 

IF feature, = 1, and fcaturc 2 = 1, THEN class = positive. 

Let us illustrate the conversion of a decision tree into a hidden layer by using the above example. 
First, if an example is tested on the positive side of hyp, then that example will be classified along 
edge I, as shown in figure 3; otherwise, it will be classified along edge 0. Starting at the root node a, the 
training examples are divided into two nodes, b anti c. At the second level of the decision tree, the 



examples from nodes b and c are lesled against hyp 2 The examples on the positive side of the hyp 2 will 
be classified along edge 1 to a node descending from their parent node. Correspondingly, training 
examples on the negative side will be classified along edge 0 to the node descending from their parent 
node. The third hyperplane is needed to divide the examples at nodes f and g. 

To convert the decision tree shown on the right-hand side of figure 3 into a hidden layer of a 
neural network, three Adalines are utilized. The directional vector of a hyperplane corresponds to the 
weight vector of an Adalinc. For hypp the weights w | and W 2 are the connection strengths of inputs x^ 
and x 2 to Adaline #1 (neuron #1). Corresponding to the decision tree, a hidden layer with three nodes is 
generated, as shown on the left-hand side of figure 3. 


CONTINUOUS ID3 ALGORITHM 

In order to use ID3 for generating a neural network architecture, it was modified [8] to operate on 
continuous data and to search for the weight vectors. The following notation is used. There are N 
training examples, N + examples belonging to class “+” and N _ examples belonging to class A 
hyperplane divides the examples as either lying on its positive (1) or negative (0) side. There are four 
possible outcomes: 


N"| number of examples from class “+” on side 1 , 

Nq number of examples from class “+” on side 0, 

N~[ number of examples from class on side 1, and 
Nq number of examples from class on side 0, 


The following relations hold: 


N = N + + N" = Nf + Nf + Njt + N 0 


NjS =N + -Nf 


N« = N~ — Nj" 


(la) 

(lb) 

(lc) 


At a certain level of a decision tree, it is assumed that N f examples were divided by node r into: 
N* belonging to class “+” and NjT belonging to class Relations analogous to those of equation (1) 
follow: 


N r = N? +N7 = Nf r +N lr +Nj r + N 0r 


Ku = n; - N( r 


No r = N7-N7 r 


(2a) 

(2b) 

(2c) 
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The information entropy at level L of a decision tree is an average of entropies of all R nodes in this layer: 


^ N, „ x 

E = -y-rf entropy (L,r) 
N 
r = l 

The change in information entropy is stated as [8] 

R 

AE = -X 

r=I 

The learning rule which minimizes the entropy function is 


dE AK1+ AE ANI _ 
AN )r 




ANr, 


(3) 


(4) 


Awjj = -p 
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(5) 


where Dj stands for the desired output of a training example, and out ; is a sigmoid function; r is a learn- 
ing rate. The learning process for adjusting the weights can be stated in a vector form as follows: 

W k + ,=W k +AW ( 6 ) 

Unfortunately, when the learning rule specified by equation (6) is used, the learning process 
might converge to a local minimum, since the gradient method docs not guarantee constant information 
gain while generating a hidden layer. In order to increase the chance of finding the global minimum, the 
learning rule equation (6) was combined [8] with Cauchy training [10, 11}. 

Aw = T(l)tan|nP(AW<Aw)-rt/ 2 ] ( 7 ) 

To calculate the size of this weight change, a random number is selected from a uniform distribution 
over [0, 1], and substituted for P(X < x). To determine whether to accept the weight change, Boltzmann 
distribution was used [8]. The probability of the error e is calculated by equation (8), where k is the 
Boltzmann constant. 


/ 

P(c) = exp 

\ 



( 8 ) 


The final learning rule, incorporating the concept of a Cauchy training, is thus defined by 
equation (9), where random weight vector AW rj|n( j nni is calculated from equation (7), and v\ is a control 
parameter. 


W k4 , = W k +(l n)AW + qAW ramlom 
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So far, we have used only crisp entropy. As an alternative, the fuzzy entropy can be used also 
[14]. Comparison of the performance of the two measures is presented in the Results and Discussion 
section. 


Let us now briefly introduce the notion of fuzzy entropy. A fuzzy entropy measure is a function 
f: P(X) -> R, where P(X) denotes the set of all fuzzy subsets of X. The function f assigns a value f(A) to 
each fuzzy subset A of X that characterizes the degree of fuzziness of A. It must satisfy the following 
three axioms: 


f(A) = 0 if and only if A is a crisp set. 

If A is less fuzzy than B, then f(A) < f(B). 

f(A) assumes the maximum if and only if A is maximally fuzzy. 

DeLuca and Termini [15] were first to define the fuzzy entropy function: 


f ( A ) = -X{^A (x)lw ^A( x ) + l | -MA( x )| lo e2l 1 -MA( x )l} 

X€X 


where p is a membership function. 

Other measures of fuzziness were proposed by Kaufmann [16], Knopfmacher [17], and Loo [18]. 
The fuzzy entropy measure used here, which was also used in [14], was proposed by Kosko [19, 20]: 


f(A) = 


^count(A n A c ) 
^count(Au A c ) 


( 10 ) 


where Z count (sigma-count) is the fuzzy cardinality [21, 22] and A c is a Zadeh’s complement [23]. In 
the experiments reported in the Results and Discussion section, the fuzzy entropy equation (10) is used 
interchangeably with crisp entropy. The generalized fuzzy operations introduced by Dombi [24] are used 
to combine the fuzzy sets [14]. Generalized Dombi\s operations form one of the several classes of 
functions which possess properties of fuzzy unions and intersections. The operations are detailed as 
follows. 


Fuzzy union = 



(ii) 


where X is a parameter by which different unions are distinguished, and X £ (0, «>). 



Fuzzy intersection = 


] 

X 


( 12 ) 


where A. is a parameter by which different intersections are distinguished, and A. e (0, <»). Experience tells us 
that the parameter A, = 4 gives best results [25, 26]. 

Thus, in the learning rule equation (5), the fuzzy entropy f(A) can be used instead of crisp entropy E. 


Awjj = -p 


3f(A) 

f Kj 


( 13 ) 


where f(A) is a fuzzy entropy function. The grades of membership for fuzzy sets A and A c were defined 
as follows [14]: 


A = 


Nu 

N„’ 


Njr. 

N, r ’ 


Njjr NJir 

N„r ' N (|r 


(14a) 


A c = I - A 


(14b) 


Using mutual dependence of positive and negative examples on both sides of a hyperplane, the 
resulting fuzzy set A (with four grades of membership), and its fuzzy complement A c , were expressed as 


N7-wr r 

N^-N( r 

K 

N lr 

N r - N|r “ Nf, ’ 

N r -Nf r -Nr r ’ 

N, r ’ 

N lr 


(15a) 


A c = 


N,-Nr r -r< 

N r -Nf r -Nf r ’ N r - N* ( - N[ r ’ 


n; -n^, 


N|,-N|, 

Ni r 


N, r ~ N7 r 
N, r 


(15b) 


The four grades of membership will be used in equations (II) and (12) to calculate the fuzzy 
entropy equation (TO). Obtained in this way, fuzzy entropy will be used to calculate the weights fof 
the learning rule equation (9). The CID3 algorithm [8] follows: ; ;; 


Step 1. For a given problem with N training examples, follow the notations given in equations 
(1) and (2). Start with a random initial weight vector W ( . 

Step 2. Utilize learning rule equation (9) and search for a hyperplane that minimizes the following 
entropy function (either crisp or fuzzy): 

R M 

X tsj 

— - entropy(L, r) 

W, r= , N 


Step 3. If the minimized entropy is not zero, hut is smaller than the previous value, add a node to 
the current layer and return to step 2. Otherwise, go to step 4. 

Step 4, If the hidden layer consists of more than one node, generate a new layer that utilizes 
inputs from hoth the original training data and the outputs from all previously general- 
ized layers, and go to step 2, If the hidden layer consists of only one node, then the 
problem is reduced to a linearly separable one; slop. 


The Cl 03 algorithm was designed to generate a multiple hiyer network functioning like a single 
Adaline node and was defined as a super- Adaline | S ]_ To solve multiple-category classification problems, 
one can easily build a network (27] consisting of many such supcr-Adalines. After a hidden layer is 
generated by the CID3 algorithm, the outputs from all the generated hidden layers, together with the 
original inputs, are used to generate a new hidden layer. The use of the information from both the origi- 
nal training data and the outputs from the previously generated hidden layers allows a learning process to 
converge faster because of the increased dimensionality of training data [27]. The connection between 
non-adjacent layers are called shortcuts. Feedforward networks without shortcuts, like backpropagation, 
can be seen as a special case of such fully-connected networks with shortcuts. 

From the machine learning point of view, a decision tree corresponds to a hidden layer of a neural 
network. If a correct classification of training examples is obtained, the corresponding entropy is reduced 
to zero. The learning which uses the knowledge from both original training examples and the outputs 
from hidden layers is actually a generalization process. 


RESULTS AND DISCUSSION 

In order to demonstrate the learning capability of the CID3 algorithm, it was applied to two 
problems. First, the CID3 algorithm was applied for recognition of defects found in manufactured glass 
[28] and compared with a standard backpropagation algorithm. 

Several types of defects found in float glass ribbon can be grouped into two categories: actual 
defects, and surface anomalies. The latter are caused by water droplets or other airborne debris. The 
anomalies are detected as defects, and the section of glass containing them must be discarded, resulting in 
a loss of otherwise useable glass. Common defects found in flow glass ribbon [29] are described as 
follows: 

True defects - permanent structures that degrade the homogeneity and optical quality of the glass. 

Bubble - a round or elongated gaseous inclusion within the glass, which may be open at top or 
bottom surface. 

Stone - a crystalline or amorphous inclusion within the glass, which may be opaque or slightly 
translucent. 

Tin drop - a depression on the surface caused by a drop of molten tin adhering to the glass surface 
during forming; the solidified tin drop remains in the depression. 

Surface anomalies - nonrejcctable, temporary marks or spots on the glass surface. 

Water droplet -a more or less hemispherical drop of liquid water, which may occur on either surface. 

Water spot - mineral residue from a dried drop of water, which may occur on either surface. 

A laser scanner system was used for the detection of defects in newly manufactured float glass 
ribbon. A rotating mirror scans a focused laser beam through the continually moving glass ribbon onto a 
sensor device. This device has two separate sensor arrays, one of which detects beam absorption, while 
the other detects beam refraction. The sensors create an electronic signal which varies voltage amplitude 
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in proportion to the amount of beam intensity that is absorbed or refracted. The scanner traverses the 
moving glass ribbon with a focused laser beam 2400 times per second. This beam passes through the 
glass and is modified by any defects before striking the receiver. The receiver converts the incident light 
energy into a proportional electrical signal, which is the subject of our analysis. 

A total of 293 images of defects were obtained [29]. Training examples consisted of 205 images 
selected randomly, while the remaining 88 were used as test examples. The sizes of the images obtained 
by the imaging system varied in proportion to the size of the actual defect. Typical images of a bubble 
and a water droplet (determined from several scans) are shown in figures 4(a) and 4(b). Images ranged in 
size from 30 by 20 pixels to 250 by 200 pixels. Because of this, preprocessing was done to normalize the 
images before they could be used to train a neural network. The preprocessing method [28] scaled the 
images to a 10 by 10 pixel frame without changing the aspect ratios or the image intensities. The scaled 
images were placed in the center of the 10 by 10 pixel frame as shown in figures 5(a) and 5(b). The 10 
rows of an image, each 10 pixels in size, were then arranged to form a single vector. 

For the purpose of distinguishing true defects (stone, bubble, or tin drop) in the glass from surface 
anomalies (water, water spot), the 205 training examples were divided into two groups representing 
defects and surface anomalies. The neural network was then trained with this data. The 88 test examples 
were then applied to the trained network to let it classify them into two categories. The correct recogni- 
tion rates for the CID3 and backpropagation [29] networks arc listed in table I. The results indicate that 
for two category classification, CID3 and backpropagation gave very high correct recognition rales for the 
test examples. The normalized CPU times required to train the networks are shown in table II. Training 
time required for backpropagation was much longer than that for the CID3 algorithm. 

With backpropagation, the number of hidden layers and the number of nodes in each layer have 
to be determined. An inadequate number of layers or nodes might prevent convergence during training. 
An excessive number of nodes would result in a longer training time. The CID3 algorithm does not 
require the network architecture to be a priori specified. Based on the information entropy function, the 
algorithm adds the necessary number of layers and nodes to correctly recognize all the input-output pairs 
in the training data. The CID3 algorithm may be useful in situations where the networks arc to be gener- 
ated automatically and in real time. There may also be situations where there is a time constraint on the 
training time. Under these circumstances, the choice of the CID3 network would be appropriate. 

Next, the CID3 algorithm was tested on difficult, non-lincarly separable data [8]. The problem 
was to distinguish two spirals [8, 30]. The two sets of spiral data consisted of 192 points, with 96 points 
for each spiral. One spiral was generated as a reflection of another, namely <x,,y |> = <-x 2 ,-y 2 >> which 
made the problem not linearly separable. The formulas used to generate the spirals are given below. 

p = a0, a = 10.0, 0 < 6 n 


spiral 1: 


x | =pcos(9) 
y,=psin(6) 


spiral 2: j 


x 2 =-pcos(0) 

y : =-psin(0) 


The generated neural network architecture is shown in figure 6. Connections to the node in the 
second hidden layer arc shown in detail, with connections to other layers shown by thick arrows. While 
generating a hidden layer, the corresponding decision tree is also recorded in order to specify a set of 
decision rules. 


N 



Comparison with other machine learning algorithms [5, 31, 32] that describe a concept by gener- 
ating rectangular decision regions reveals the advantage of the CID3 algorithm — that it generates very 
concise descriptions. This contrasts with other machine learning algorithms which would generate many 
decision rules specifying numerous small rectangular regions for the two-spiral problem. 

The obtained neural network architecture with the learned weights was applied to the spiral test 
data consisting of 150 by 150 pixels, specified in terms of x, and x 2 coordinates, that cover a square area 
of [-15 < Xj < 15, -15 < x 2 < 15]. The result is shown in Figure 7(d). The white region represents spiral 
#1, and the black region represents spiral #2. Since at a hidden layer training examples are mapped into 
an image space by C1D3, one may apply a machine learning algorithm to the output of a hidden layer and 
generate decision rules. The study of combining the CID3 algorithm and a machine learning algorithm 
called CL1LP2 [9] was reported in [33]. Here we repeat the results in table III, and show the discriminat- 
ing power of each network in figure 7. 

The CLILP2 algorithm generates decision rules from the already extracted features much faster 
than CID3. This results in fast generation of a simple neural network architecture. No significant differ- 
ence in discrimination ability was observed by analyzing the output images. This means that in the search 
for the optimal architecture, one may concentrate on the training lime and the complexity of the network 
alone. It is easy to notice that Net 4 corresponds to the architecture shown in figure 5. 

In order to demonstrate the performance of the fuzzy entropy measure, the CID3 was again 
applied to the spiral data using fuzzy entropy. Let us note here that learning to distinguish the two spirals 
is a very difficult task for backpropagation networks. This failure in training backpropagation neural 
networks was reported in [30] and was also confirmed by [33]. Actually, the CID3 algorithm can be seen 
as superior to work reported in [30], since the latter's architecture was obtained by using a trial-and-error 
method. 


The fuzzy version of the CID3 algorithm generates the same architecture as the one generated by 
the crisp CID3. The nodes within the hidden layer are generated until the fuzzy entropy is reduced to 
zero. The crisp pseudoentropy measure accomplishes the same task quite well. However, a remarkable 
progress in terms of convergence time is achieved by using a fuzzy entropy measure with generalized 
Dombi operations. 


CONCLUSIONS 


CID3 self-gencrates a neural network architecture without the need to use a trial-and-error 
method to find an “optimal” architecture required by backpropagation-lype networks. As a trade-off 
between the effort used for training and the quality of results, the CID3 algorithm seems to be 
competitive. 

Unlike backpropagation, where correct classification of training examples is achieved only at the 
output layer, training examples arc correctly recognized by CID3 at a hidden layer for which the 
information entropy is for the first time reduced to zero. In the process of generating a hidden layer by 
CID3, it is easy to specify the corresponding decision rules which describe the class memberships of the 
training examples [33]. 



The CID3 lends to machine learning algorithms its capability of working on continuous data and 
its immunity to noise. The CID3 algorithm helps in generalizing knowledge. The output of the last layer 
specifies the most general rule, and the outputs of a layer closer to the input layer specify more specific 
rules. 


In conclusion, we have shown the advantages of the CID3 algorithm by illustrating the impact of 
a machine learning algorithm on the neural network algorithm in terms of what one can contribute to the 
other. Two alternative ways of calculating the entropy, crisp and fuzzy, were used. The fuzzy entropy 
method showed better performance in terms of convergent time. 
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TABLE I. - RECOGNITION RATES FOR CID3 AND 
BACKPROPAGATION NETWORKS 


Method 

True defect 
recognitions 

Anomaly 

recognitions 

Total 

CID3 


35/36 

85/88 

*(100:7:6:6:1) 

| 

(97.22%) 

(96.59%) 

Backpropagation 

51/52 

34/36 

85/88 

*(100:20:1) 

(98.07%) 

(94.44%) 

(96.59%) 


"Network architecture - the number of nodes in each layer. 


TABLE II. - CPU TIMES TO TRAIN NETWORKS 
FOR TWO-CATEGORY CLASSIFICATION 


Method 

Normalized CPU time, 
minutes 

CID3 

161 

Backpropagation 

615 


TABLE in. - TRAINING TIMES AND ARCHITECTURE PARAMETERS OF FOUR NETWORKS 


Neural 

network 

CPU 

time, 

minutes 

Number 
of hidden 
layers 

Total 

number of 
nodes 

Number 
of nodes 
at layer 1 

Number 
of nodes 
at layer 2 

Number 
of nodes 
at layer 3 

Number 
of nodes 
at layer 4 

Number 
of nodes 
at layer 5 

Number 
of nodes 
at layer 6 

Netj 

19.57 

2 

47 

24 CID3 

22 CLILP2 

*CLILP2 

— 



— 

Net 2 

25.20 

3 

43 

\'ID3 

15 C1D3 

24 CLILP2 

* CLILP2 

— 

— 

Net 3 

36.90 

4 

31 

**C1D3 

15 C1D3 

5 CID3 

7 CLILP2 

*CULP2 

— 

Net 4 

58.07 

5 

31 

^C1D3 

15 C1D3 

5 CID3 

4 CID3 

3 CID3 

lciD3 
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Figure 1 . — Seventeen training example* belonging to Figure 2. — Decision regions specified by hyperplanes, 

class V and class*-’. 


1 

Adaline #1 : 2 + 

Entropy 1-0.861 

1 

Adaline #2 : j + 

Entropy2-0J67 q- 

1 

Adaline # 3: 

Entropy) -0.0 

Entropyi - “^[( lxlo 8*5 + 4xIogij) + (4xlog^ + 8 xl°gqj)] - 0.861 bit 

Entropy: - — ^£(0 + 0 ) + (0 + 0 ) + (lxlog^ + 2 xlogj|) + (7xk>g^- + 2 xlog^) J - 0567 bit 

Entropy: - ~jj[(0 + 0) + (0 + 0) + (0 + 0) + (0 + 0)] « 0.0 bit 

Figure 3. — Hidden layer corresponding to a decision tree and entropies calculated by using equation (3). 
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Front view 


Side view 


(a) Air bubble. 



Front view Side view 

(b) Water droplet. 

Figure 4. — Amplitude outlines of an air bubble and a water droplet. 



(a) Air bubble. (b) Water droplet. 

Figure 5. — Scaled Images of an air bubble and a water droplet 
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layer 12 3 4 5 6 



(a) Net, (b) Netj (c)Net 3 (d)Net 4 

Figure 7.— Output Images of four networks. 
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